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Abstract 

Most accurate predictions are typically obtained by learning machines with complex 
feature spaces (as e.g. induced by kernels). Unfortunately, such decision rules are hardly 
accessible to humans and cannot easily be used to gain insights about the application 
domain. Therefore, one often resorts to linear models in combination with variable se- 
lection, thereby sacrificing some predictive power for presumptive interpretability. Here, 
^ we introduce the Feature Importance Ranking Measure (FIRM), which by retrospective 

analysis of arbitrary learning machines allows to achieve both excellent predictive perfor- 
C/3 mance and superior interpretation. In contrast to standard raw feature weighting, FIRM 

takes the underlying correlation structure of the features into account. Thereby, it is able 
to discover the most relevant features, even if their appearance in the training data is en- 
t> tirely prevented by noise. The desirable properties of FIRM are investigated analytically 

OO and illustrated in simulations, 

in 

(N 

1 Introduction 

A major goal of machine learning — beyond providing accurate predictions — is to gain 
understanding of the investigated problem. In particular, for researchers in application areas, 
^ it is frequently of high interest to unveil which features are indicative of certain predictions. 

Existing approaches to the identification of important features can be categorized according 
to the restrictions that they impose on the learning machines. 

The most convenient access to features is granted by linear learning machines. In this 
work we consider methods that express their predictions via a real-valued output function 
s : — M, where X is the space of inputs. This includes standard models for classification, 
regression, and ranking. Linearity thus amounts to 

s(x) = + 6 . (1) 



X 



One popular approach to finding important dimensions of vectorial inputs {X = M") is fea- 
ture selection, by which the training process is tuned to make sparse use of the available d 
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candidate features. Examples include £i-regularized methods like Lasso [E] or £i-SVMs P 
and heuristics for non-convex io-vegulauzed formulations. They all find feature weightings w 
that have few non-zero components, for example by eliminating redundant dimensions. Thus, 
although the resulting predictors are economical in the sense of requiring few measurements, 
it can not be concluded that the other dimensions are unimportant: a different (possibly even 
disjoint) subset of features may yield the same predictive accuracy. Being selective among 
correlated features also predisposes feature selection methods to be unstable. Last but not 
least, the accuracy of a predictor is often decreased by enforcing sparsity (see e.g. [TO]). 

In multiple kernel learning (MKL; e.g. [3 [TU]) a sparse linear combination of a small set 
of kernels [S is optimized concomitantly to training the kernel machine. In essence, this 
lifts both merits and detriments of the selection of individual features to the coarser level of 
feature spaces (as induced by the kernels). MKL thus fails to provide a principled solution 
to assessing the importance of sets of features, not to speak of individual features. It is now 
urban knowledge that £i-regularized MKL can even rarely sustain the accuracy of a plain 
uniform kernel combination [2]. 

Alternatively, the sparsity requirement may be dropped, and the j-th component wj of 
the trained weights w may be taken as the importance of the j-th input dimension. This 
has been done, for instance, in cognitive sciences to understand the differences in human 
perception of pictures showing male and female faces ^ ; here the resulting weight vector w 
is relatively easy to understand for humans since it can be represented as an image. 

Again, this approach may be partially extended to kernel machines [8] , which do not access 
the features explicitly. Instead, they yield a kernel expansion 

n 

s(x) = ^aiA;(xi,x) -h 6 , (2) 

i=l 

where (xj)j=i^...^n are the inputs of the n training examples. Thus, the weighting a £ M" 
corresponds to the training examples and cannot be used directly for the interpretation of 
features. It may still be viable to compute explicit weights for the features $(x) induced 
by the kernel via fc(x,x') = (<I>(x), <^(x')), provided that the kernel is benign: it must be 
guaranteed that only a finite and limited number of features are used by the trained machine, 
such that the equivalent linear formulation with 

n 

w = ^a,^>(xi) 

i=l 

can efficiently be deduced and represented. 

A generalization of the feature weighting approach that works with general kernels has 
been proposed by Ustiin et. al. |14j . The idea is to characterize input variables by their 
correlation with the weight vector o;. For a linear machine as given by ([T]) this directly results 
in the weight vector w; for non-linear functions s, it yields a projection of w, the meaning of 
which is less clear. 

A problem that all above methods share is that the weight that a feature is assigned by a 
learning machine is not necessarily an appropriate measure of its importance. For example, by 
multiplying any dimension of the inputs by a positive scalar and dividing the associated weight 
by the same scalar, the conjectured importance of the corresponding feature can be changed 
arbitrarily, although the predictions are not altered at all, i.e. the trained learning machine 
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is unchanged. An even more practically detrimental shortcoming of the feature weighting 
is its failure to take into account correlations between features; this will be illustrated in a 
computational experiment below (Section |3]) . 

Further, all methods discussed so far are restricted to linear scoring functions or kernel 
expansions. There also exists a range of customized importance measures that are used for 
building decision trees and random forests (see e.g. |1H [T2] for an overview). 

In this paper, we reach for an importance measure that is "universal" : it shall be applicable 
to any learning machine, so that we can avoid the clumsiness of assessing the relevance of 
features for methods that produce suboptimal predictions, and it shall work for any feature. 
We further demand that the importance measure be "objective", which has several aspects: it 
may not arbitrarily choose from correlated features as feature selection does, and it may not 
be prone to misguidance by feature rescaling as the weighting-based methods are. Finally, the 
importance measure shall be "intelligent" in that it exploits the connections between related 
features (this will become clearer below). 

In the next section, we briefly review the state of the art with respect to these goals 
and in particular outline a recent proposal, which is, however, restricted to sequence data. 
Section [2] exhibits how we generalize that idea to continuous features and exhibits its desirable 
properties. The next two sections are devoted to unfolding the math for several scenarios. 
Finally, we present a few computational results illustrating the properties of our approach in 
the different settings. The relevant notation is summarized in Table [T] 



symbol definition 

X input space 

s(x) scoring function X 

w weight vector of a linear scoring function s 

f feature function X 

qf{t) conditional expected score M ^ M 

Qf feature importance ranking measure (firm) G 

Q vector G of firms for d features 

X), So, covariance matrix, and its jth column 



reference 



equation (ll 
equation (6|) 



definition 
definition 
subsectionl2.4 



Table 1: Notation 



1.1 Related Work 

A few existing feature importance measures satisfy one or more of the above criteria. One 
popular "objective" approach is to assess the importance of a variable by measuring the 
decrease of accuracy when retraining the model based on a random permutation of a variable. 
However, it has only a narrow application range, as it is computationally expensive and 
confined to input variables. 

Another approach is to measure the importance of a feature in terms of a sensitivity 
analysis |3j 



E 



ds 
dxj 



Var [Xj 



1/2 



(3) 
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This is both "universal" and "objective". However, it clearly does not take the indirect effects 
into account: for example, the change of Xj may imply a change of some X/^. (e.g. due to 
correlation), which may also impact s and thereby augment or diminish the net effect. 

Here we follow the related but more "intelligent" idea of |T7] : to assess the importance of a 
feature by estimating its total impact on the score of a trained predictor. While |17] proposes 
this for binary features that arise in the context of sequence analysis, the purpose of this paper 
is to generalize it to real-valued features and to theoretically investigate some properties of 
this approach. It turns out (proof in Section 2.2) that under normality assumptions of the 
input features, FIRM generalizes ([S]), as the latter is a first order approximation of FIRM, 
and because FIRM also takes the correlation structure into account. 

In contrast to the above mentioned approaches, the proposed feature importance ranking 
measure (FIRM) also takes the dependency of the input features into account. Thereby it is 
even possible to assess the importance of features that are not observed in the training data, 
or of features that are not directly considered by the learning machine. 



1.2 Positional Oligomer Importance Matrices pjTj 

In |17j . a novel feature importance measure called Positional Oligomer Importance Matrices 
(POIMs) is proposed for substring features in string classification. Given an alphabet S, for 
example the DNA nucleotides S = {A,C,G,T}, let x G T,^ be a sequence of length L. The 
kernels considered in [T7] induce a feature space that consists of one binary dimension for 
each possible substring y (up to a given maximum length) at each possible position i. The 
corresponding weight Wy,i is added to the score if the substring y is incident at position i in 
x. Thus we have the case of a kernel expansion that can be unfolded into a linear scoring 
system: 

s(x) = Y,Wy,lMi]=y} , (4) 

where !{•} is the indicator function. Now POIMs are defined by 

Q'(z,j) := E[s(X)|X[j]=z]-E[s(X)] , (5) 

where the expectations are taken with respect to a D-th order Markov distribution. 

Intuitively, Q' measures how a feature, here the incidence of substring z at position j, 
would change the score s as compared to the average case (the unconditional expectation). 
Although positional sub-sequence incidences are binary features (they are either present or 
not), they posses a very particular correlation structure, which can dramatically aid in the 
identification of relevant features. 



2 The Feature Importance Ranking Measure (FIRM) 

As explained in the introduction, a trained learner is defined by its output or scoring function 
s : A' ^ M . The goal is to quantify how important any given feature 

f :X ^ R (6) 

of the input data is to the score. In the case of vectorial inputs X = M'^, examples for features 
are simple coordinate projections /j(x) = Xj, pairs fjk{^) = XjXj^ or higher order interaction 
features, or step functions fj^ri^) = ^{xj > t} (where !{•} is the indicator function). 
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We proceed in two steps. First, we define the expected output of the score function under 
the condition that the feature / attains a certain value. 



Definition 1 (conditional expected score). The conditional expected score of s for a feature 
f is the expected score : M ^ M conditional to the feature value t of the feature f : 

qjit) = n<X)\f[X)=t] . (7) 

We remark that this definition corresponds — up to normalization — to the marginal 
variable importance studied by van der Laan [15]. A flat function qj corresponds to a feature 
/ that has no or just random effect on the score; a variable function Qf indicates an important 
feature /. 

Consequently, the second step of FIRM is to determine the importance of a feature / as 
the variability of the corresponding expected score : M — > M. 

Definition 2 (feature importance ranking measure). The feature importance Qf £M of the 
feature f is the standard deviation of the function qf: 

1 

2 



Qf := ^Yar[qfifiX))]=^jjqf{t)-qfyFr{f{X)=t)dtj' , (8) 

where qf := K[qf{f{X))] = J^qf{t)Fr{f{X) = t) dt is the expectation of qf. 

In case of (i) known linear dependence of the score on the feature under investigation or 
(ii) an ill-posed estimation problem ([8| — for instance, due to scarce data — , we suggest to 
replace the standard deviation by the more reliably estimated slope of a linear regression. As 



we will show later (Section 2.3), for binary features identical feature importances are obtained 
by both ways anyway. 



2.1 Properties of FIRM 



FIRM generalizes POIMs. As we will show in Section Section 2.3 FIRM indeed contains 
POIMs as special case. POIMs, as defined in ([5]), are only meaningful for binary features. 
FIRM extends the core idea of POIMs to continuous features. 



FIRM is "universal". Note that our feature importance ranking measure (FIRM) can 
be applied to a very broad family of learning machines. For instance, it works in both 
classification, regression and ranking settings, as long as the task is modeled via a real-valued 
output function over the data points. Further, it is not constrained to linear functions, as 
is the case for /i-based feature selection. FIRM can be used with any feature space, be it 
induced by a kernel or not. The importance computation is not even confined to features 
that are used in the output function. For example, one may train a kernel machine with a 
polynomial kernel of some degree and afterwards determine the importance of polynomial 
features of higher degree. We illustrate the ability of FIRM to quantify the importance of 
unobserved features in Section [3^31 
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FIRM is robust and "objective". In order to be sensible, an importance measure is 
required to be robust witli respect to perturbations of the problem and invariant with re- 
spect to irrelevant transformations. Many successful methods for classification and regression 
are translation-invariant; FIRM will immediately inherit this property. Below we show that 
FIRM is also invariant to rescaling of the features in some analytically tractable cases (in- 
cluding all binary features), suggesting that FIRM is generally well-behaved in this respect. 



In Section 2.4.3 we show that FIRM is even robust with respect to the choice of the learning 
method. FIRM is sensitive to rescaling of the scoring function s. In order to compare differ- 
ent learning machines with respect to FIRM, s should be standardized to unit variance; this 

~ 1 /2 

yields importances Qf = Q//Var ' that are to scale. Note, however, that the relative 

importance, and thus the ranking, of all features for any single predictor remains fixed. 

Computation of FIRM. It follows from the definition of FIRM that we need to assess 
the distribution of the input features and that we have to compute conditional distributions 
of nonlinear transformations (in terms of the score function s). In general, this is infeasible. 
While in principle one could try to estimate all quantities empirically, this leads to an esti- 
mation problem due to the limited amount of data. However, in two scenarios, this becomes 
feasible. First, one can impose additional assumptions. As we show below, for normally 
distributed inputs and linear features, FIRM can be approximated analytically, and we only 
need the covariance structure of the inputs. Furthermore, for linear scoring functions ([T]), we 
can compute FIRM for (a) normally distributed inputs (b) binary data with known covari- 
ance structure and (c) — as shown before in [TB] — for sequence data with (higher-order) 
Markov distribution. Second, one can approximate the conditional expected score qf by a 
linear function, and to then estimate the feature importance Qj from its slope. As we show 



in Section 2.3 this approximation is exact for binary data. 



2.2 Approximate FIRM for Normally Distributed Features 

For general score functions s and arbitrary distributions of the input, the computation of 
the conditional expected score ([7| and the FIRM score ([8| is in general intractable, and 
the quantities can at best be estimated from the data. However, under the assumption of 
normally distributed features, we can derive an analytical approximation of FIRM in terms 
of first order Taylor approximations. More precisely, we use the following approximation. 

Approximation For a normally random variable X ^ M (jl, Xl^ and a dijferentiable 
function g : M.'^ ^ W^, the distribution of g{X) is approximated by its first order Taylor 



expansion: 



with 



g{X) ~ M(^g{Jl),J^J^) 



= IT 
ax 



Note that if the function g is linear, the distribution is exact. 

In the course of this subsection, we consider feature functions /j(x) = Xj (an extension to 
linear feature functions /(x) = x^a is straightforward.) 
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First, recall that for a normally distributed random variable X ~ (0, XI), the conditional 
distribution oi X\Xj = t is again normal, with expectation 



E[X\ Xj=t] 



-Sj, — : fij 



■'33 



Here Xlj, is the jth column of XI. 

Now, using the above approximation, the conditional expected score is 



To obtain the FIRM score, we apply the approximation again, this time to the function 
t s{{{t/'Sjj) Ejm)- Its first derivative at the expected value t = equals 



J 



1 



ds 



^33 



x=0 



This yields 



■'33 



9x 



x=0 



(9) 



Note the correspondence to ([S]) in Friedman's paper [3]: If the features are uncorrelated, ^ 
simplifies to 



-'33 



ds 
9xi 



(recall that = Hence FIRM adds an additional weighting that corresponds to 

the dependence of the input features. These weightings are based on the true covariance 
structure of the predictors. In applications, the true covariance matrix is in general not 
known. However, it is possible to estimate it reliably even from high- dimensional data using 
mean-squared-error optimal shrinkage [7]. 

Note that the above approximation can be used to compute FIRM for the kernel based 
score functions ([2]). E.g., for Gaussian kernels 



/c^(x, Xj) 



exp 



X - Xi 



we have 



(9^(x,Xi) 



5x 



x=0 



2fc(0,x^ 

7^ 



-x. 



and hence obtain 



ds 
9x 



x=0 



N 

E 

i=l 



2e- 



aiUi 



r 
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2.3 Exact FIRM for Binary Data 



Binary features are both analytically simple and, due to their interpretability and versatility, 
practically highly relevant. Many discrete features can be adequately represented by binary 
features, even if they can assume more than two values. For example, a categorical feature 
can be cast into a sparse binary encoding with one indicator bit for each value; an ordinal 
feature can be encoded by bits that indicate whether the value is strictly less than each of its 
possibilities. Therefore we now try to understand in more depth how FIRM acts on binary 
variables. 

For a binary feature / : X {a,b} with feature values t £ {cL,b}, let the distribution be 
described by 

Pa = ¥T{f{X)=a),pt = l-pa, 

and let the conditional expectations be qa = Qfia) and qi, = qf{h). Simple algebra shows that 
in this case Var [q{f{X))\ = PaPbiQa — Qb)"^- Thus we obtain the feature importance 

Qf = {Qa - qb)VPaPb ■ (10) 

(By dropping the absolute value around qa — qb we retain the directionality of the feature's 
impact on the score.) Note that we can interpret firm in terms of the slope of a linear function. 
If we assume that a, 5 G M, the linear regression fit 

{wf,Cf) = arg min / {{wft + cA — qf{t))'^ dFv (t) 



the slope is Wf = '^'^Z.f' ■ The variance of the feature value is Var [/(X)] = PaPbi^- ~ b) ■ (10) 
is recovered as the increase of the linear regression function along one standard deviation of 
feature value. As desired, the importance is independent of feature translation and rescaling 
(provided that the score remains unchanged). In the following we can thus (without loss of 
generality) constrain that t S { — 1,+!}. 

Let us reconsider POIMS Q' , which are defined in equation ([s]). We note that Q'{b) := 
Qb- q= PaiQb - Qa) = VPa/PbQib); thus Q{z,j) Can be recovered as 

Q{z,j) = Q'(z, jVlPr (X[j] / z) /Ft (X[i] = z) . 

Thus, while POIMs are not strictly a special case of FIRM, they differ only in a scaling factor 
which depends on the distribution assumption. For a uniform Markov model (as empirically 
is sufficient according to [I7j), this factor is constant. 



2.4 FIRM for Linear Scoring Functions 

To understand the properties of the proposed measure, it is useful to consider it in the case 
of linear output functions ([T]) . 

2.4.1 Independently Distributed Binary Data 

First, let us again consider the simplest scenario of uniform binary inputs, X ~ unif{{ — l, +1}'^); 
the inputs are thus pairwise independent. 

First we evaluate the importance of the input variables as features, i.e. we consider pro- 
jections /j(x) = Xj. In this case, we immediately find for the conditional expectation qj{t) of 
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the value t of the j-th variable that qj{t) = twj + b. Plugged into (10) this yields Qj = Wj, 
as expected. When the features are independent, their impact on the score is completely 
quantified by their associated weights; no side effects have to be taken into account, as no 
other features are affected. 

We can also compute the importances of conjunctions of two variables, i.e. 

/jAfc(x) = = +1 A Xfc = +1} . 

Here we find that (7jAfe(l) = wj+Wk+b and (?jAfc(0) = —^{wj+Wk)+b, with Pr (/,Afc(X) = 1) = 
|. This results in the feature importance Qj^k = {wj +Wk)/V^- This calculation also applies 
to negated variables and is easily extended to higher order conjunctions. 

Another interesting type of feature derives from the xor- function. For features /j0fc(x) = 
I{xj / Xfc} the conditional expectations vanish, qj^ki^) = Qjcski^) = 0. Here the FIRM 
exposes the inability of the linear model to capture such a dependence. 

2.4.2 Binary Data With Empirical Distribution 

Here we consider the empirical distribution as given by a set {xj \ i = 1, . . . ,n } ofn data 
points Xj G {—1, +1}'^: IPr {X) = ^ J2i=i ^ i-^ — ^«}- -^^^ input features /j(x) = xj, this leads 
to qj{t) = ^ Si-Xi =t w''^Xi + b, where Ujt := \ {i | Xj j = t } | counts the examples showing 



the feature value t. With (10) we get 



Qj = {qj{+l)-qj{-l))^J^r{Xj = +l)Pr(X,- = -1) 



^ ' w X,- ' ' 



l- 



/ ^ \ * / A/ 2 



It is convenient to express the vector Q G M'^ of all feature importances in matrix notation. 
Let X G M"^'^ be the data matrix with the data points Xj as rows. Then we can write 

Q = M^Xw with M G M"''"' = l„xdDo + XDi 

with diagonal matrices Dq, Di G R'^^'^ defined by 

(Di)„- = , (Do),-,- = ""''^^ ~ . (11) 

With the empirical covariance matrix S = ^X'''X, we can thus express Q as Q = DoldxnXw+ 
nDiXlw. Here it becomes apparent how the FIRM, as opposed to the plain w, takes the cor- 
relation structure of the features into account. Further, for a uniformly distributed feature 
j (i.e. Fi {Xj = t) = ^), the standard scaling is reproduced, i.e. (Di)jj = ^I, and the other 
terms vanish, as (Do)j-j = 0. 

For X containing each possible feature vector exactly once, corresponding to the uni- 
form distribution and thus independent features, M''^X is the identity matrix (the covariance 
matrix), recovering the above solution of Q = w. 

2.4.3 Continuous Data With Normal Distribution 

If we consider normally distributed input features and assume a linear scoring function (IT]), 



the approximations above (Section 2.2) are exact. Hence, the expected conditional score of 
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an input variable is 

Qjit) = ^w^Ej. + b . (12) 

With the diagonal matrix D of standard deviations of the features, i.e. with entries D^j = 
Y^S , this is summarized in 

q = ftlrf + tD-^Ew. 

Exploiting that the marginal distribution of X with respect to the j-th variable is again a 
zero-mean normal, Xj ~ AA(0, ^jj), this yields Q = D~^Xlw. For uncorrelated features, D is 
the square root of the diagonal covariance matrix so that we get Q = Dw. Thus rescaling 
of the features is reflected by a corresponding rescaling of the importances — unlike the plain 
weights, FIRM cannot be manipulated this way. 

As FIRM weights the scoring vector by the correlation D^^Ti between the variables, it is 
in general more stable and more reliable than the information obtained by the scoring vector 
alone. As an extreme case, let us consider a two-dimensional variable {Xi,X2) with almost 
perfect correlation p = cor(Xi, X2) ~ 1. In this situation, Ll-type methods like lasso tend to 
select randomly only one of these variables, say w = (wi, 0), while L2-regularization tends to 
give almost equal weights to both variables. FIRM compensates for the arbitrariness of lasso 
by considering the correlation structure of X: in this case q = {wi,pwi), which is similar to 
what would be found for an equal weighting vi^ = ^{w,w), namely (7 = (u;(l-|-p)/2, u;(l-|-p)/2). 

Linear Regression. Here we assume that the scoring function s is the solution of an 
unregularized linear regression problem, min^.b ||Xw-yf ; thus w = (X^X) X^y. 
Plugging this into the expression for Q from above yields 

Q = J:>~^-E(n±^ ^X^y . (13) 

For infinite training data, XI — > Xl, we thus obtain Q = ^D~^X^y. Here it becomes 
apparent how the normalization makes sense: it renders the importance independent of a 
rescaling of the features. When a feature is inflated by a factor, so is its standard deviation 
Djj, and the effect is cancelled by multiplying them. 

3 Simulation Studies 

We now illustrate the usefulness of FIRM in a few preliminary computational experiments on 
artificial data. 

3.1 Binary Data 

We consider the problem of learning the Boolean formula xi V {^xi A -'X2). An SVM with 
polynomial kernel of degree 2 is trained on all 8 samples that can be drawn from the Boolean 
truth table for the variables {xi,X2,X3) G {0, 1}^. Afterwards, we compute FIRM both based 
on the trained SVM (w) and based on the true labelings (y). The results are displayed in 
Figure [T] 
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Figure 1: FIRMs and SVM-u; for the Boolean formula xi V {^xi A -1X2) • The figures display 
heat maps of the scores, blue denotes negative label, red positive label, white is neutral. The 
upper row of heat maps shows the scores assigned to a single variable, the lower row shows the 
scores assigned to pairs of variables. The first column shows the SVM-u) assigning a weight to 
the monomials xi, X2, X3 and X1X2, X1X3, X2X3 respectively. The second column shows FIRMs 
obtained from the trained SVM classifier. The third column shows FIRMs obtained from the 
true labeling. 



Note that the raw SVM w can assign non-zero weights only to feature space dimensions 
(here, input variables and their pairwise conjunctions, corresponding to the quadratic kernel); 
all other features, here for example pairwise disjunctions, are implicitly assigned zero. The 
SVM assigns the biggest weight to X2, followed by xi A X2- In contrast, for the SVM-based 
FIRM the most important features are xi A -1X2 followed by ^Xi/2, which more closely re- 
sembles the truth. Note that, due to the low degree of the polynomial kernel, the SVM not 
capable of learning the function "by heart" ; in other words, we have an underfitting situation. 
In fact, we have s(x) = 1.6 for (xi,X2) = (0, 1). 

The difference in y-FIRM and SVM-FIRM underlines that — as intended — FIRM helps 
to understand the learner, rather than the problem. Nevertheless a quite good approximation 
to the truth is found as displayed by FIRM on the true labels, for which all seven 2-tuples 
that lead to true output are found (black blocks) and only -1X1 A X2 leads to a false value 
(stronger score). Values where -1X1 and X2 are combined with X3 lead to a slightly negative 
value. 



3.2 Gaussian Data 

Here, we analyze a toy example to illustrate FIRM for real valued data. We consider the 
case of binary classification in three real- valued dimensions. The first two dimensions carry 
the discriminative information (cf. Figure^), while the third only contains random noise. 
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The second dimension contains most discriminative information and we can use FIRM to 
recover this fact. To do so, we train a hnear SVM classifier to obtain a classification function 
s(x). Now we use the linear regression approach to model the conditional expected scores 
Qi (see Figure [2|3-d for the three dimensions). We observe that dimension two indeed shows 
the strongest slope indicating the strongest discriminative power, while the third (noise) 
dimension is identified as uninformative. 




Figure 2: Binary classification performed on continuous data that consists of two 3d Gaussians 
constituting the two classes (with x-^ being pure noise). From left to right a) Of the raw data 
set xi, X2 are displayed, b) Score of the linear discrimination function s(xj) (blue) and 
conditional expected score gi((xj)i) (red) for the first dimension of x. c) s(xj) and q2{{^i)2) 
for varying X2- As the variance of q is highest here, this is the discriminating dimension 
(closely resembling the truth), d) s(xj) and (73((xj)3) for varying X3. Note that 2:3 is the noise 
dimension and does not contain discriminating information (as can be seen from the small 
slope of 53) 



3.3 Sequence Data 



As shown above (Section 1.2 ), for sequence data FIRM is essentially identical to the previously 
published technique POIMs [17". To illustrate its power for sequence classification, we use a 
toy data set from [9]: random DNA sequences are generated, and for the positive class the 
sub-sequence GATTACA is planted at a random position centered around 35 (rounded normal 
distribution with SD=7). As biological motifs are typically not perfectly conserved, the 
planted consensus sequences are also mutated: for each planted motif, a single position is 
randomly chosen, and the incident letter replaced by a random letter (allowing for no change 
for ~ 25% of cases). An SVM with WDS kernel ^ is trained on 2500 positive and as many 
negative examples. 

Two analyses of feature importance are presented in Figure |3} one based on the feature 
weights w (left), the other on the feature importance Q (right). It is apparent that FIRM 
identifies the GATTACA feature as being most important at positions between 20 and 50, and it 
even attests significant importance to the strings with edit distance 1. The feature weighting 
w, on the other hand, fails completely: sequences with one or two mutations receive random 
importance, and even the importance of the consensus GATTACA itself shows erratic behavior. 

The reason is that the appearance of the exact consensus sequence is not a reliable feature, 
as is mostly occurs mutated. More useful features are substrings of the consensus, as they 
are less likely to be hit by a mutation. Consequently there is a large number of such features 
that are given high weight be the SVM. By taking into account the correlation of such 
short substrings with longer ones, in particular with GATTACA, FIRM can recover the "ideal" 
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Figure 3: Feature importance analyses based on (left) the SVM feature weighting w and (right) 
FIRM. The shaded area shows the ±1 SD range of the importance of completely irrelevant features 
(length 7 sequences that disagree to GATTACA at every position). The red lines indicate the positional 
importances of the exact motif GATTACA; the magenta and blue lines represent average importances 
of all length 7 sequences with edit distances 1 and 2, respectively, to GATTACA. While the feature 
weighting approach cannot distinguish the decisive motiv from random sequences, FIRM identifies it 
confidently. 



feature which yields the highest SVM score. Note that this "intelligent" behavior arises 
automatically; no more domain knowledge than the Markov distribution (and it is only 0-th 
order uniform!) is required. The practical value of POIMs for real world biological problems 
has been demonstrated in |17j . 

4 Summary and Conclusions 

We propose a new measure that quantifies the relevance of features. We take up the idea 
underlying a recent sequence analysis method (called POIMs, [T7]) — to assess the importance 
of substrings by their impact on the expected score — and generalize it to arbitrary continuous 
features. The resulting feature importance ranking measure FIRM has invariance properties 
that are highly desirable for a feature ranking measure. First, it is "objective": it is invariant 
with respect to translation, and reasonably invariant with respect to rescaling of the features. 
Second, to our knowledge FIRM is the first feature ranking measure that is totally "universal" , 
i.e. which allows for evaluating any feature, irrespective of the features used in the primary 
learning machine. It also imposes no restrictions on the learning method. Most importantly, 
FIRM is "intelligent": it can identify features that are not explicitly represented in the 
learning machine, due to the correlation structure of the feature space. This allows, for 
instance, to identify sequence motifs that are longer than the considered substrings, or that 
are not even present in a single training example. 

By definition, FIRM depends on the distribution of the input features, which is in general 
not available. We showed that under various scenarios (e.g. binary features, normally dis- 
tributed features), we can obtain approximations of FIRM that can be efficiently computed 
from data. In real-world scenarios, the underlying assumptions might not always be fulfilled. 
Nevertheless, e.g. with respect to the normal distribution, we can still interpret the derived 
formulas as an estimation based on first and second order statistics only. 
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While the quahty of the computed importances does depend on the accuracy of the trained 
learning machine, FIRM can be used with any learning framework. It can even be used 
without a prior learning step, on the raw training data. Usually, feeding training labels as 
scores into FIRM will yield similar results as using a learned function; this is natural, as both 
are supposed to be highly correlated. 

However, the proposed indirect procedure may improve the results due to three effects: 
first, it may smooth away label errors; second, it extends the set of labeled data from the 
sample to the entire space; and third, it allows to explicitly control and utilize distributional 
information, which may not be as pronounced in the training sample. A deeper understanding 
of such effects, and possibly their exploitation in other contexts, seems to be a rewarding field 
of future research. 

Based on the unique combination of desirable properties of FIRM, and the empirical 
success of its special case for sequences, POIMs [T7], we anticipate FIRM to be a valuable 
tool for gaining insights where alternative techniques struggle. 
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